Reference-based compositional analysis

Jeferyd Yepes G.

May 2025

📝 Objective: Carry out a downstream analysis using the Kraken2/Bracken output to report important insights in terms of differences among samples.

About the data

The dataset we will be using for this part of the workshop was collected in Cuatro Ciénegas (México), an oasis in the Mexican desert whose environmental conditions are often linked to the ones present in ancient seas, due to a higher-than-average content of sulfur and magnesium but a lower concentrations of phosphorus and other nutrients. For more information about this study, please check Okie et al.,(2020); the BioProject accession number is PRJEB22811.

library(knitr)
options(conflicts.policy = list(warn = FALSE)) #this line removes all warnings of function masking

#change the PATH according to your settings
tab_sam = read.csv("/Users/yepesgar/Downloads/SIB_IMDAMM/Day1_pm/script/data/test_sheet.csv")
kable(tab_sam)
Name Group Replicate
ERR2143758 Cont 1
ERR2143759 Cont 2
ERR2143760 Cont 3
ERR2143771 Unenriched 1
ERR2143772 Unenriched 2
ERR2143774 Unenriched 3
ERR2143769 Fertilized 1
ERR2143770 Fertilized 2
ERR2143773 Fertilized 3

The pipeline

For this course, we propose to wrap with Nextflow the protocol published by Jennifer Lu et al. (2022). The workflow is designed as follows:

As you can see from the picture, the input is FASTQ files from one or multiple samples, then it performs a host removal with Bowtie2 by aligning the reads against an indexed reference genome to carry out the taxonomic classification with Kraken2 afterwards. Taking the output from Kraken2, a Bayesian species abundance re-estimation is achieved with Bracken, and Krona plots are generated using the Bracken output to visualize interactively the relative abundance of each annotated species. If multiple samples are used as input to the pipeline, the bracken reports will be concatenated and converted into a Biological Observation Matrix (BIOM) file. Finally, the BIOM file will be first converted in a Phyloseq object, and this object will be further processed to generate absolute plots, estimate both α and β-diversity and perform a network analysis; this information will be presented in a final report.html.

Now, we are going to move to proper directory to run the pipeline, you should be in the directory nf4-science. Copy and paste and the following commands:

Bash

mv data/metagenomics ./
cd metagenomics

We are ready to run the pipeline then, with the command:

Bash

nextflow run main.nf --sheet_csv 'data/samplesheet.csv'

Once it has finished, go back to the working directory

Bash

cd .. 

Krona Plots

As mentioned before, this pipeline will provide us with a set of files that are useful for our primary goal related to exploratory data analysis. However, as we do not have enough computational power and storage capacity in GitHub Codespaces, we have executed the pipeline with the real data and databases beforehand. Below you can find the Krona plots depicting the species abundance per sample.

As a reminder, Krona plots are interactive, multi-layered pie charts used to visualize hierarchic data derived from taxonomic annotation of the sequences.

Pavian

Another way to explore the species abundance across samples, and more importantly compare them is to use Pavian. Pavian is an interactive, web-based application designed to analyze, explore, and visualize metagenomic classification results, particularly from tools like Kraken2, Centrifuge, and MetaPhlAn.

To use this web server you will need just to upload the Kraken/Bracken reports (.k2report) found at intermediate/kreports/ (Download and extract the folder intermediate.tar.gz from the Moodle page).

In case Pavian web server is not working properly, we have prepared the report for you, you can find it at intermediate/Pavian

Questions

Now, perform a visual inspection of the results, and consider the following questions:

Question: Do you see anything unusual on it? what would be the reasons causing this problem? what do you propose to solve it?

Question : What do you think of the percentages considering the type of samples we are analyzing?

Question: Do you observe any difference among samples/replicates, at what taxonomic level?

Phyloseq object

Now, let’s create the Phyloseq object using the BIOM file generated by the pipeline and the experiment metadata. We are going to manipulate a bit the object to add the names of the taxonomic ranks, complete the species names and subset it to have only two conditions.

library(phyloseq)
library(ggplot2)
library(tidyr)

data_biom <- import_biom("/Users/yepesgar/Downloads/SIB_IMDAMM/Day1_pm/script/data/merged.biom")
data_biom@tax_table@.Data <- substring(data_biom@tax_table@.Data, 4)
colnames(data_biom@tax_table@.Data)<- c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species")

tax_table_df <- as.data.frame(tax_table(data_biom))

# Modify the species column to include genus
tax_table_df$Species <- paste(tax_table_df$Genus, tax_table_df$Species, sep = " ")

# Update the taxonomy table in the phyloseq object
tax_table(data_biom) <- as.matrix(tax_table_df)

# Phyloseq object with only Bacteria
data_biom_bact <- subset_taxa(data_biom, Kingdom == "Bacteria")

samples_df <- tab_sam %>% 
  tibble::column_to_rownames("Name") 
samples = sample_data(samples_df)
otu_inf = data_biom_bact@otu_table
phy_inf = data_biom_bact@tax_table
merged <- phyloseq(otu_inf,phy_inf,samples)

two_sample_data <- subset_samples(merged, Group == "Cont" | Group == "Unenriched")
two_sample_data
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 4322 taxa and 6 samples ]
## sample_data() Sample Data:       [ 6 samples by 2 sample variables ]
## tax_table()   Taxonomy Table:    [ 4322 taxa by 7 taxonomic ranks ]

We already have created our Phyloseq object, and we are ready to move on with the analysis.

α-Diversity

Next, Let’s explore the α-diversity among samples using two indexes, Chao1 and Shannon:

p = plot_richness(two_sample_data, measures=c("Chao1", "Shannon"), color = "Group")+
    theme_bw()+
    theme(axis.text.x = element_text(angle = 45, vjust = 1, 
                                   size = 12, hjust = 1),
        axis.text.y = element_text(size = 14),
        axis.title = element_text(size = 14),
        panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
        panel.border = element_blank(),
        strip.text.x = element_text(size = 14),
        legend.text = element_text(size = 10),
        legend.title = element_text(size = 14),
        axis.line = element_line(colour = "black"),
        )+
  labs(x = "Samples",
       color = "Group")
  
p + geom_point(size=5, alpha=0.7)

β-Diversity

β-Diversity measures how different two or more communities are, either in their composition (richness) or in the abundance of the organisms that compose it (abundance). To perform the analysis, we need to use dimensionality reduction techniques which enable comparisons in a space closer to our understanding (up to 3 dimensions). First, let’s create a heatmap that includes an ordination-based ordering of samples or taxa with a Bray-Curtis dissimilarity matrix as input for the ordination method (PCoA):

Heatmap

WARN: For visualization purposes, we filter the OTU table using a pre-specified criterion, and we are using agglomerated data.

library(phyloseq)
library(ggplot2)
library(tidyr)

phylum_data = tax_glom(two_sample_data, "Phylum")
threshold <- 500
phylum_data_filt <- filter_taxa(phylum_data, function(x) sum(x) > threshold, prune = TRUE)

phyloseq::plot_heatmap(phylum_data_filt, method = "PCoA", distance = "bray",
             taxa.label = "Phylum",
             trans=NULL, low="beige", high="red", na.value="beige")+    theme_bw()+
    theme(axis.text.x = element_text(angle = 45, vjust = 1, 
                                   size = 12, hjust = 1),
        axis.text.y = element_text(size = 14),
        axis.title = element_text(size = 14),
        panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank(),
        strip.text.x = element_text(size = 14),
        legend.text = element_text(size = 10),
        legend.title = element_text(size = 14))

Ordination plots

Principal Coordinate Analysis (PCoA)

Ordination plots are obtained after the application of a method aiming to reduce the dimensionality of the data. As before, we are using a Bray-Curtis dissimilarity matrix as input for a PCoA to establish data structures that guide us towards the identification of differences among samples.

data.ord <- ordinate(two_sample_data, "PCoA", "bray")
plot_ordination(two_sample_data, data.ord, type="samples", color="Group",
                title="Principal Coordinate Analysis")+
    theme(axis.text = element_text(size = 14), axis.title = element_text(size = 14),
        legend.text = element_text(size = 14), strip.text.x = element_text(size = 14),
        plot.title = element_text(size = 16),
        legend.title = element_text(size=16),
        panel.border = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        axis.line = element_line(colour = "black"),
        panel.background = element_blank())+
  geom_point(size=5, alpha=0.7)

Network Analysis

Now, leveraging the Phyloseq object and its native functionalities, we can perform a simple network analysis (using Bray-Curtis matrix as input):

set.seed(42)
plot_net(phylum_data, distance = "bray", type = "taxa",
         maxdist = 0.9, color="Phylum")+
  guides(color = guide_legend(ncol = 3))

Exercises

Now it’s your turn to carry on with the analysis. For instance, you can subset the Phyloseq object to compare the fertilized ponds against the control, then:

α-Diversity: Do you observe any difference between samples terms of α-diversity? Try other indexes.

Below, there is a hint:

measures <- c("Observed", "Chao1", "ACE", "Shannon", "Simpson", "InvSimpson", "Fisher")

β-Diversity: Are the samples organized in an expected organization along the gradients (coordinates)? Does the PCoA capture well the variability of the data? Try different ordination methods to compare the results.

Below, there is a hint:

dist_methods <- unlist(distanceMethodList)

Normalization/transformation methods: As you may have noticed, we did not apply any normalization/transformation method to the data. Let’s try a centered-log ratio transformation.

For the centered-log ratio transformation, this is how it is achieved:

BiocManager::install("microbiome")
library(microbiome)
data_biom_clr <- microbiome::transform(phylum_data_filt, "clr")

Now, re-create the heatmap and the PCoA plots with the transformed data and using Euclidean as distance method. Evaluate if there are differences.

Optional

Different taxonomic ranks: We have performed the analysis at phylum level, try changing the taxonomic rank and try to track if the differentially abundant taxa belong to the same phyla identified with the preliminary analysis. Be prepared for really messy plots!

Bonus

Empty fields: Thr the following codeline and observe the results. There are species that do not account with genus, order or taxa level information. How is this possible? How do you propose to handle this? What are the implications of this situation?

summary(data_biom@tax_table@.Data== "")
##   Kingdom          Phylum          Class           Order        
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:4608      FALSE:4605      FALSE:4589      FALSE:4590     
##  TRUE :1         TRUE :4         TRUE :20        TRUE :19       
##    Family          Genus          Species       
##  Mode :logical   Mode :logical   Mode :logical  
##  FALSE:4571      FALSE:4606      FALSE:4609     
##  TRUE :38        TRUE :3

Additional resources

If you want to explore more in detail how you can expand your analysis, please check the following resources: